05. Data Cleaning Process
The Process
The very first thing to do before any cleaning occurs is to make a copy of each piece of data. All of the cleaning operations will be conducted on this copy so you can still view the original dirty and/or messy dataset later. Copying DataFrames in pandas is done using the
copy
method
. If the original DataFrame was called
df
, the soon-to-be clean copy of the dataset could be named
df_clean
.
df_clean = df.copy()
Note that simply assigning a DataFrame to a new variable name leaves the original DataFrame vulnerable to modifications, as explained in the answers to this Stack Overflow question: " Why should I make a copy of a DataFrame in pandas? "
Data Cleaning Process
An Example
Note: a copy of the original dataset was not made before cleaning in the following example, though one should have been.
Data Cleaning Process
Quiz
Using the snapshot of the
patients
table and the output of
patients.info()
below, answer the following matching quiz.

Snapshot of the patients table

Output of
patients.info()
Data Cleaning Process
QUIZ QUESTION: :
Match each statement below to the appropriate step of the data cleaning process for the zip code issues in the patients_clean table (a copy of the patients table).
ANSWER CHOICES:
Data Cleaning Step |
Statement |
---|---|
patients_clean.zip_code.head() |
|
Convert the zip code column's data type from a float to a string using
|
|
Zip code has four digits sometimes |
|
Zip code is a float not a string |
|
patients_clean.zip_code = patients_clean.zip_code.astype(str).str[:-2].str.pad(5, fillchar='0') |
SOLUTION:
Data Cleaning Step |
Statement |
---|---|
patients_clean.zip_code.head() |
|
Convert the zip code column's data type from a float to a string using
|
|
patients_clean.zip_code = patients_clean.zip_code.astype(str).str[:-2].str.pad(5, fillchar='0') |